I don’t have much experience with wine. However, I did notice that we have a feature called “quality”, and my analysis will revolve around finding out correlations of other variables to quality.
We’ll go through the summary of the dataset and a few univariate plots to understand the structure of the data.
Before doing a deep dive into the data, I pull up a summary of every single variable. This gives me, at a glance, the distribution, mean, median, min, and max values for all the variables. This will come in handy when I create univariate plots around these variables, because it’ll allow me to select parameters like the binsize.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
nrow(winedata)
## [1] 1599
ncol(winedata)
## [1] 13
We observe the following:
Observations:
I’ll first look at a few numeric features and their distributions. There are no categorical features. I will not be getting rid of any outliers in my univariate analysis, because I can’t be sure yet if they are errors, or whether the data was collected in such a way that we see outliers. For example, the distribution of quality is disproportionate as we saw above.
The bin-width has to set by looking at the min/max value. I’ll keep referring to the summary statistics to choose the right no. of bins or the binsize, depending on which is convenient.
Observations:
Observations:
Observations:
The value of pH ranges from 0 to 14 (15 values). I’ve chosen 30 as the bin size so that I get a 0.5 resolution on the pH.
Observations:
Observations:
I iterated on the number of bins from 10 to 20 to 40, till I was satisfied with the visual resolution.
Observations:
Observations:
Observations:
There are 1599 observations across 13 features. As mentioned on the website, there are no missing values for any of the features.
All the fields are numeric in nature. Quality is the only field that’s human-defined. Rest of the fields are chemical properties of the wine.
Apart from citric acid, all the other plots were normally distributed, or normally distributed with a positive skew. I saw an unusually large tail of outliers in the case of residual sugar and chlorides.
The main feature of interest was the “quality”, since it was human-determined. There are no categorical variables of interest, so all numeric variables with normal distribution look about the same in a univariate analysis.
To get more information about which numeric features are of importance, we’ll need to do a bivariate analysis on the numeric feature and quality. The correlation between numeric features can also be found to detect redundance.
The dataset did mention that the quality is the median of at least 3 evaluations made by wine experts. It’d be interesting to know the difference in the scoring of the same wine between evaluations. Any difference above 1 on either side could be statistically significant and can indicate the presence of bias.
No, I did not.
I did not transform any of the data. Although the Udacity video mentioned that “normalizing” data is required to apply linear regression, it is incorrect. The data can be in any format, the error has to be normally distributed.
The only unusual distribution that I saw was for citric acid, which did not follow a normal pattern.
I’ll first plot the correlations between all variables. This will help me inspect the right ones. I’ll only look at values that are >= 0.25 in either direction. Correlations lesser in magnitude than 0.25 can also tell us a lot about the dataset, but I’m confining my investigation to the major features.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X -0.15355136 -0.031260835 -0.119868519
## fixed.acidity 0.67170343 0.114776724 0.093705186
## volatile.acidity -0.55249568 0.001917882 0.061297772
## citric.acid 1.00000000 0.143577162 0.203822914
## residual.sugar 0.14357716 1.000000000 0.055609535
## chlorides 0.20382291 0.055609535 1.000000000
## free.sulfur.dioxide -0.06097813 0.187048995 0.005562147
## total.sulfur.dioxide 0.03553302 0.203027882 0.047400468
## density 0.36494718 0.355283371 0.200632327
## pH -0.54190414 -0.085652422 -0.265026131
## sulphates 0.31277004 0.005527121 0.371260481
## alcohol 0.10990325 0.042075437 -0.221140545
## quality 0.22637251 0.013731637 -0.128906560
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.090479643 -0.11784967 -0.36837209
## fixed.acidity -0.153794193 -0.11318144 0.66804729
## volatile.acidity -0.010503827 0.07647000 0.02202623
## citric.acid -0.060978129 0.03553302 0.36494718
## residual.sugar 0.187048995 0.20302788 0.35528337
## chlorides 0.005562147 0.04740047 0.20063233
## free.sulfur.dioxide 1.000000000 0.66766645 -0.02194583
## total.sulfur.dioxide 0.667666450 1.00000000 0.07126948
## density -0.021945831 0.07126948 1.00000000
## pH 0.070377499 -0.06649456 -0.34169933
## sulphates 0.051657572 0.04294684 0.14850641
## alcohol -0.069408354 -0.20565394 -0.49617977
## quality -0.050656057 -0.18510029 -0.17491923
## pH sulphates alcohol quality
## X 0.13600533 -0.125306999 0.24512284 0.06645261
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
Below, I enumerate all correlations that I’d explore later (>= 0.25).
Observations on quality:
Observations on fixed acidity:
Observations on volatile acidity:
Observations on citric acid:
Observations on residual sugar:
Observations on chloride:
Observations on free sulphur dioxide:
Observations on total sulphur dioxide:
Observations on density:
Observations on pH:
Observations on sulphates:
Observations on alcohol:
Let’s take a look at features that have a correlation with quality, namely: Volatile acidity, sulphates, alcohol.
It’s difficult to decipher the above scatterplot. While we know that there’s a correlation, a boxplot might give us more insights.
Observations:
Observations:
Observations:
I see three features on acidity, “fixed.acidity”, “volatile.acidity”, “citric.acidity”. We also know their correlation with each other. Let’s visualize it.
Observations:
Let’s look at a boxplot for a better view of the correlation.
In order to make the boxplot feasible, I multiply the pH by 4 and find the “ceiling.” This gives me an integer, which I then divide by 4 again. In essence, pH gets grouped in sections of 0.25, and allows me to plot a boxplot by treating the pH groups as a factor.
Observations:
For some reason, volatile acidity has a positive correlation with pH.
For quality score 6, the anomalous relationship between pH and acidity is even more apparent.
Observations:
investigation. How did the feature(s) of interest vary with other features in
the dataset?
(not the main feature(s) of interest)?
Since quality is the only categorical variable, all the multi-variate plots will have to be faceted around it. We’ll explore some of the interesting relationships from the bivariate section but faceted on quality
Observations:
Let’s explore the seemingly anomalous relationship in detail. In the bivariate section, I had hypothesized that fixed acidity is the lurking variable in the relationship between pH and volatile acidity.
Observations:
Volatile acidity, sulphates, and alcohol had strong correlation to quality. We’ll examine the relationships between these three variables faceted by quality.
Observations:
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
I’m choosing the 3 plots that represent a strong and pertinent relationship.
This is a very important plot, because it shows the relationship between the quality of the wine and the quantity of alcohol that it contains. It shows that more quantity of alcohol is correlated with better quality wine.
If we were designing a model, I’d put alcohol, volatile acidity, and sulphates in as the main features and check what accuracy I get, before pulling in other features.
Earlier, I saw an interesting relationship between volatile acidity and pH. Generally, lower pH suggests higher acidity. However, pH and volatile acidity had a positive correlation, which didn’t make sense at first.
Also, fixed acidity has a strong negative correlation with pH, as it was expected. My theory was that the observed relationship between volatile acidity and pH was driven by some third variable (fixed acidity), and they weren’t really directly correlated.
This is one of the plots that shows why the data is the way it is. For different values of pH, we see both positive and negative correlation between volatile and fixed acidity. It means that the amount of volatile acidity really doesn’t depend on fixed acidity, or else we would have seen correlation in only one direction for all values of pH. This means that pH is not really measuring the volatile acidity, but the dataset’s construction leads to this red herring correlation between pH and volatile acidity.
I selected this plot because when we go ahead and train a regression model to predict quality, we’d want all of our variables to be as uncorrelated as possible. Or else, it indicates a certain degree of redundancy. Increase in dimensionality of features would require more amount of data to maintain accuracy.
In cases of fixed and citric acid, I see an almost linear relationship. In such cases, instead of having both of them as features, I’d just have one feature that’s the ratio of the two, or just pick one of the two. This way, the regressor has to deal with one less feature.
As we saw above, the same is not true for fixed and volatile acidity, so I’d leave them separate and untouched.
Plots like this allows us to make important decisions on reducing dimensionality, checking if PCA would help, etc. Basically, with lesser dimensions and redundancy, our model will be able to learn better with the same amount of data.
When I started with the exploration, the first thing that I noticed was that all the features were numeric. The target variable, Quality, was the only categorical variable.
I had to adjust my mindset to this dataset, because having categorical features helps with plotting interesting multivariate plots. Because the features were numerical, I had to come up with artificial factors by grouping continuous variables into discrete intervals. I did this in some areas where my domain knowledge helped me get the right interval, for example, the pH. Quality was the other variable that I used to do multivariate plots.
That target that I set for myself was to find out what influences the quality. When I read the description of quality from the website that I obtained the dataset from, it said that the quality was determined from evaluation by 3 different wine experts. However, it would have been immensely informative for me to have those 3 scores separately, rather than just the mean or media.
In the univariate section, I was more interested in getting to know the structure of the data. I started with quality, so that I knew the “class distribution.” As I expected, it turned out to resemble a normal distribution. Knowing this was critical, because it influenced how I interpreted the other univariate distributions.
For example, if I ever saw a normal distribution for another feature, I took it as a candidate that could predict the quality, because its distribution matched that of the quality, and could thus have differentiating information encoded in it. Of course, correlation was used to confirm to what degree a variable affected quality, but I’m just focusing on the visual aspects.
Univariate distributions that would have been uniform or partly uniform, might not have helped much predicting the wine quality, except for a subset of target classes (they’d only have helped with identifying, say, higher quality wine, whereas being uniform for other wines).
The bivariate analysis was an interesting opportunity to learn about interrelationships between variables. Since there were a lot of combinations of variables, I started out by calculating correlations between all of them, so that I could narrow down on a few combinations that I’d end up visualizing.
This is where I investigated strong relationships of features with the quality. It was also a good way of finding redundancies between features (high correlation). When modelling, we deal with the curse of dimensionality, and bivariate plots help us confirm if we can reduce the dimensionality of features by either eliminating the redundant ones, or pulling in linear combinations of variables instead of having them separately.
I also started investigating the relationship between volatile acidity and pH. Although my initial assumption was that they should be negatively correlated, but I was wrong in my assumption of how the pH was being measured, and how volatile acidity is measured. I drew a couple of plots around it.
The multivariate section was more of a way to reaffirm observations from the bivariate section. I only faceted on quality (which is the only categorical variable) and pH (because I had the domain knowledge to know that grouping intervals of pH could lead to interesting plots).
I selected the final plots to bring out correlation with quality, which was our target variable. But instead of just fixating on quality, I also picked out plots that showed strong correlation, hence redundancy, in the data. Since the volatile acidity vs. pH was initially confusing due to my assumption, I included that plot as well to demonstrate analysis that required several graphs to understand.